NB: The worksheet has beed developed and prepared by Maxim Romanov for the course “R for Historical Research” (U Vienna, Spring 2019).
The following are the librsries that we will need for this section. Install those that you do not have yet.
#install.packages("tidyverse", "readr", "stringr", "text2vec", "tidytext", "wordcloud", "RColorBrewer"", "quanteda", "readtext", "igraph")
# General ones
library(tidyverse)
library(readr)
library(RColorBrewer)
# text analysis specific
library(stringr)
library(text2vec)
library(tidytext)
library(wordcloud)
library(quanteda)
library(readtext)
library(igraph)d1861 <- read.delim("./dispatch/dispatch_1861.tsv", encoding="UTF-8", header=TRUE, quote="")RFunctions are groups of related statements that perform a specific task, which help breaking a program into smaller and modular chunks. As programs grow larger and larger, functions make them more organized and manageable. Functions help avoiding repetition and makes code reusable.
Most programming languages, R including, come with a lot of pre-defined—or built-in—functions. Essentially, all statements that take arguments in parentheses are functions. For instance, in the code chunk above, read.delim() is a function that takes as its arguments: 1) filename (or, path to a file); 2) encoding; 3) specifies that the file has a header; and 4) not using " as a special character. We can also write our own functions, which take care of sets of operations thet we tend to repeat again and again.
Later, take a look at this video by one of the key R developers, and check this tutorial.
(From Wikipedia) In geometry, a hypotenuse is the longest side of a right-angled triangle, the side opposite the right angle. The length of the hypotenuse of a right triangle can be found using the Pythagorean theorem, which states that the square of the length of the hypotenuse equals the sum of the squares of the lengths of the other two sides (catheti). For example, if one of the other sides has a length of 3 (when squared, 9) and the other has a length of 4 (when squared, 16), then their squares add up to 25. The length of the hypotenuse is the square root of 25, that is, 5.
Let’s write a function that takes lengths of catheti as arguments and returns the length of hypothenuse:
hypothenuse <- function(cathetus1, cathetus2) {
hypothenuse<- sqrt(cathetus1*cathetus1+cathetus2*cathetus2)
print(paste0("In the triangle with catheti of length ",
cathetus1, " and ", cathetus2, ", the length of hypothenuse is ", hypothenuse))
#return(hypothenuse)
}hypothenuse(3,4)## [1] "In the triangle with catheti of length 3 and 4, the length of hypothenuse is 5"
Let’s say we want to clean up a text so that it is easier to analyze it: 1) convert everithing to lower case; 2) remove all non-alphanumeric characters; and 3) make sure that there are no multiple spaces:
clean_up_text = function(x) {
x %>%
str_to_lower %>% # make text lower case
str_replace_all("[^[:alnum:]]", " ") %>% # remove non-alphanumeric symbols
str_replace_all("\\s+", " ") # collapse multiple spaces
}text = "This is a sentence with punctuation, which mentions Vienna, the capital of Austria."
clean_up_text(text)## [1] "this is a sentence with punctuation which mentions vienna the capital of austria "
Let’s load all issues of Dispatch from 1862. We can quickly check what types of articles are there in those issues.
library(tidytext)
# the "quotes" attribute avoids that R identifies quotes differently
d1862 <- read.delim("./dispatch/dispatch_1862.tsv", encoding="UTF-8", header=TRUE, quote="", stringsAsFactors = FALSE)
d1862 %>%
count(type, sort=T) # to find the most frequent wordsWe can create subsets of articles based on their types.
articles_d1862 <- d1862 %>%
filter(type=="article")
advert_d1862 <- d1862 %>%
filter(type=="advert")
orders_d1862 <- d1862 %>%
filter(type=="orders")
death_d1862 <- d1862 %>%
filter(type=="death" | type == "died")
married_d1862 <- d1862 %>%
filter(type=="married")Now, let’s tidy them up: to work with this as a tidy dataset, we need to restructure it in the one-token-per-row format, which as we saw earlier is done with the unnest_tokens() function.
test_set <- death_d1862
test_set_tidy <- test_set %>%
mutate(item_number = cumsum(str_detect(text, regex("^", ignore_case = TRUE)))) %>%
select(-type) %>%
unnest_tokens(word, text) %>%
mutate(word_number = row_number())
test_set_tidy# stop words can remove data that actually should not be removed. If possible use own stop words list
data("stop_words")
test_set_tidy_clean <- test_set_tidy %>%
anti_join(stop_words, by="word")
test_set_tidy_cleantest_set_tidy %>%
anti_join(stop_words, by="word") %>%
count(word, sort = TRUE) library(wordcloud)
library("RColorBrewer")
test_set_tidy_clean <- test_set_tidy %>%
anti_join(stop_words, by="word") %>%
count(word, sort=T)
set.seed(1234)
wordcloud(words=test_set_tidy_clean$word, freq=test_set_tidy_clean$n,
min.freq = 1, rot.per = .25, random.order=FALSE, #scale=c(5,.5),
max.words=150, colors=brewer.pal(8, "Dark2"))your response
test_set <- death_d1862
test_set_tidy <- test_set %>%
mutate(item_number = cumsum(str_detect(text, regex("^", ignore_case = TRUE)))) %>%
select(-type) %>%
unnest_tokens(word, text) %>%
mutate(word_number = row_number())
test_set_tidy_clean <- test_set_tidy %>%
anti_join(stop_words, by="word") %>%
count(word, sort=T)
set.seed(1234)
wordcloud(words=test_set_tidy_clean$word, freq=test_set_tidy_clean$n,
min.freq = 1, rot.per = .25, random.order=FALSE, #scale=c(5,.5),
max.words=150, colors=brewer.pal(8, "Dark2"))For more details on generating word clouds in R, see: http://www.sthda.com/english/wiki/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know.
Where exactly in a text certain words appear may be of extreme value for understanding a specific text. Let’s try to plot something like that for all the issues of Dispatch of 1861.
d1862 <- read.delim("./dispatch/dispatch_1862.tsv", encoding="UTF-8", header=TRUE, quote="", stringsAsFactors = FALSE)
test_set <- d1862
test_set$date <- as.Date(test_set$date, format="%Y-%m-%d")
test_set_tidy <- test_set %>%
mutate(item_number = cumsum(str_detect(text, regex("^", ignore_case = TRUE)))) %>%
select(-type) %>%
unnest_tokens(word, text) %>%
mutate(word_number = row_number())
test_set_tidyourWord = "donelson"
word_occurance_vector <- which(test_set_tidy$word == ourWord)
plot(0, type='n', #ann=FALSE,
xlim=c(1,length(test_set_tidy$word)), ylim=c(0,1),
main=paste0("Dispersion Plot of `", ourWord, "` in Dispatch (1862)"),
xlab="Newspaper Time", ylab=ourWord, yaxt="n")
segments(x0=word_occurance_vector, x1=word_occurance_vector, y0=0, y1=2)# col=rgb(0,0,0,alpha=0.3) -- can be included as a parameter to segment to make lines more transparentThis kind of plot works better with texts rather than with newspapers. Let’s take a look at a script of Episode I:
SW_to_DF <- function(path_to_file, episode){
sw_sentences <- scan(path_to_file, what="character", sep="\n")
sw_sentences <- as.character(sw_sentences)
sw_sentences <- gsub("([A-Z]) ([A-Z])", "\\1_\\2", sw_sentences)
sw_sentences <- gsub("([A-Z])-([A-Z])", "\\1_\\2", sw_sentences)
sw_sentences <- as.data.frame(cbind(episode, sw_sentences), stringsAsFactors=FALSE)
colnames(sw_sentences) <- c("episode", "sentences")
return(sw_sentences)
}
sw1_df <- SW_to_DF("./sw_scripts/sw1.md", "sw1")
sw1_df_tidy <- sw1_df %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(sentences, regex("^#", ignore_case = TRUE))))
sw1_df_tidy <- sw1_df_tidy %>%
unnest_tokens(word, sentences)ourWord = "anakin"
word_occurance_vector <- which(sw1_df_tidy$word == ourWord)
#plot(x=word_occurance_vector, type="h", )
plot(0, type='n', #ann=FALSE,
xlim=c(1,length(sw1_df_tidy$word)), ylim=c(0,1),
main=paste0("Dispersion Plot of `", ourWord, "` in SW1"),
xlab="Movie Time", ylab=ourWord, yaxt="n")
segments(x0=word_occurance_vector, x1=word_occurance_vector, y0=0, y1=2)# col=rgb(0,0,0,alpha=0.3) -- can be included as a parameter to segment to make lines more transparentFor newspapers—and other diachronic corpora—a different approach will work better:
test_set_tidy_freqDay <- test_set_tidy %>%
anti_join(stop_words, by="word") %>%
group_by(date) %>%
count(word)
test_set_tidy_freqDay# interesting examples:
# deserters, killed,
# donelson (The Battle of Fort Donelson took place in early February of 1862),
# manassas (place of the Second Bull Run)
# shiloh (Battle of Shiloh took place in April of 1862)
ourWord = "shiloh"
test_set_tidy_word <- test_set_tidy_freqDay %>%
filter(word==ourWord)
plot(x=test_set_tidy_word$date, y=test_set_tidy_word$n, type="l", lty=3, lwd=1,
main=paste0("Word `", ourWord, "` over time"),
xlab = "1862 - Dispatch coverage", ylab = "word frequency per day")
segments(x0=test_set_tidy_word$date, x1=test_set_tidy_word$date, y0=0, y1=test_set_tidy_word$n, lty=1, lwd=2)library(quanteda)
library(readtext)
dispatch1862 <- readtext("./dispatch/dispatch_1862.tsv", text_field = "text", quote="")
dispatch1862corpus <- corpus(dispatch1862)pattern= can also take vectors (for example, c("soldier*", "troop*")); you can also search for phrases with pattern=phrase("fort donelson"); window= defines how many words will be shown before and after the match.
kwic_test <- kwic(dispatch1862corpus, pattern = 'lincoln')
kwic_testTry View(kwic_test) in your console!
text2vec libraryDocument similarity—or distance—measures are valuable for a variety of tasks, such as identification of texts with similar (or the same) content.
prep_fun = function(x) {
x %>%
str_to_lower %>% # make text lower case
str_replace_all("[^[:alnum:]]", " ") %>% # remove non-alphanumeric symbols
str_replace_all("\\s+", " ") # collapse multiple spaces
}d1862 <- read.delim("./dispatch/dispatch_1862.tsv", encoding="UTF-8", header=TRUE, quote="", stringsAsFactors = FALSE)Let’s just filter it down to some sample that would not take too much time to process. We also need to clean up our texts for better calculations.
sample_d1862 <- d1862 %>%
filter(type=="advert")
sample_d1862$text <- prep_fun(sample_d1862$text)# shared vector space
it = itoken(as.vector(sample_d1862$text))
v = create_vocabulary(it) %>%
prune_vocabulary(term_count_min = 2) #
vectorizer = vocab_vectorizer(v)prune_vocabulary() is a useful function if you work with a large corpus; using term_count_min= would allow to remove low frequency vocabulary from our vector space and lighten up calculations.
Now, we need to create a document-feature matrix:
dtm = create_dtm(it, vectorizer)The text2vec library can calculate a several different kinds of distances (details: http://text2vec.org/similarity.html), which include:
jaccardMatrix = sim2(dtm, dtm, method = "jaccard", norm = "none")
jaccardMatrix@Dimnames[[1]] <- as.vector(sample_d1862$id)
jaccardMatrix@Dimnames[[2]] <- as.vector(sample_d1862$id)Let’s take a look at a small section of our matrix. Can you read it? How should this data look in tidy format?
jaccardMatrix[1:4, 1:2]## 4 x 2 sparse Matrix of class "dgCMatrix"
## 1862-04-22_advert_244 1862-04-22_advert_245
## 1862-04-22_advert_244 1.00000000 0.05063291
## 1862-04-22_advert_245 0.05063291 1.00000000
## 1862-04-16_advert_6 0.10588235 0.11764706
## 1862-04-16_advert_7 0.06578947 0.07317073
Converting matrix into a proper tidy data frame is a bit tricky. Luckily, igraph library can be extremely helpful here. We can treat our matrix as edges, where each number is the weight of each given edge. Loading this data into igraph will help us to avoid heavy-lifting on convertion as it can do all the complicated reconfigurations of our data, converting it into a proper dataframe that conforms to the principles of tidy data.
All steps include:
igraph object from our regular matrix;jaccardMatrix <- as.matrix(jaccardMatrix)
library(igraph)
jaccardNW <- graph.adjacency(jaccardMatrix, mode="undirected", weighted=TRUE)
jaccardNW <- simplify(jaccardNW)
jaccard_sim_df <- as_data_frame(jaccardNW, what="edges")
colnames(jaccard_sim_df) <- c("text1", "text2", "jaccardSimilarity")t_jaccard_sim_df_subset <- jaccard_sim_df %>%
filter(jaccardSimilarity > 0.49) %>%
arrange(desc(jaccardSimilarity), .by_group=T)
t_jaccard_sim_df_subsetLet’s check the texts of 1862-03-10_advert_166 and 1862-02-10_advert_170, which have the score of 1.0000000 (a complete match).
example <- d1862 %>%
filter(id=="1862-09-08_advert_171")
paste(example[5])## [1] "A desirable Framed House and Lot on Mill street, in Adams's Valley, at Auction. -- We will sell upon the premises, on Thursday, the 11th day of September, commencing at 4½ o'clock P. M., a very desirable Framed House and Lot located on Mill street, in Adams's Valley, and near to the new workshops of the Virginia Central Railroad Company. It contains four good rooms, and is particularly adapted for a small sized family. The lot fronts 40 feet and runs back 200 feet. Terms. -- One-third cash; the balance at 4 and 8 months, for negotiable notes, with interest added, and secured by a trust deed. se 5 Jas. M. Taylor & Son, Auct'rs."
example <- d1862 %>%
filter(id=="1862-04-07_advert_207")
paste(example[5])## [1] "Desirable Framed House and Lot on Federal street, in Sidney, at Auction. -- We will sell upon the premises, on Tuesday, the 8th day of April, commencing at 4 ½ o'clock P. M., the desirable framed House, on the north side of Elmwood, near to Grove street, now in the occupancy of A. B. Hall. It contains several good rooms, Kitchen, & c., and adapted for a small-sized family; a well of water in the yard. The Lot fronts 30 feet, and runs back 180 feet to an alley 20 feet wide. Terms -- One-third cash; the balance at 6 and 12 months, for negotiable notes, with interest added, and secured by a trust deed. The purchaser to pay the taxes for 1862. James M. Taylor & Son, Auctioneers."
cosine and euclidean distances for the same set of texts. What is the score for the same two texts? How do these scores differ in your opinion? (Take a look at a few examples with maximum match!).cosineMatrix = sim2(dtm, dtm, method = "cosine", norm = "none")
cosineMatrix@Dimnames[[1]] <- as.vector(sample_d1862$id)
cosineMatrix@Dimnames[[2]] <- as.vector(sample_d1862$id)Let’s take a look at a small section of our matrix. Can you read it? How should this data look in tidy format?
cosineMatrix[1:4, 1:2]## 4 x 2 sparse Matrix of class "dgCMatrix"
## 1862-04-22_advert_244 1862-04-22_advert_245
## 1862-04-22_advert_244 143 20
## 1862-04-22_advert_245 20 32
## 1862-04-16_advert_6 38 14
## 1862-04-16_advert_7 16 6
Converting matrix into a proper tidy data frame is a bit tricky. Luckily, igraph library can be extremely helpful here. We can treat our matrix as edges, where each number is the weight of each given edge. Loading this data into igraph will help us to avoid heavy-lifting on convertion as it can do all the complicated reconfigurations of our data, converting it into a proper dataframe that conforms to the principles of tidy data.
All steps include:
igraph object from our regular matrix;cosineMatrix <- as.matrix(cosineMatrix)
library(igraph)
cosineNW <- graph.adjacency(cosineMatrix, mode="undirected", weighted=TRUE)
cosineNW <- simplify(cosineNW)
cosine_sim_df <- as_data_frame(cosineNW, what="edges")
colnames(cosine_sim_df) <- c("text1", "text2", "cosineSimilarity")t_cosine_sim_df_subset <- cosine_sim_df %>%
filter(cosineSimilarity > 0.49) %>%
arrange(desc(cosineSimilarity), .by_group=T)
t_cosine_sim_df_subsetLet’s check the texts of 1862-03-10_advert_166 and 1862-02-10_advert_170, which have the score of 1.0000000 (a complete match).
example <- d1862 %>%
filter(id=="1862-09-08_advert_171")
paste(example[5])## [1] "A desirable Framed House and Lot on Mill street, in Adams's Valley, at Auction. -- We will sell upon the premises, on Thursday, the 11th day of September, commencing at 4½ o'clock P. M., a very desirable Framed House and Lot located on Mill street, in Adams's Valley, and near to the new workshops of the Virginia Central Railroad Company. It contains four good rooms, and is particularly adapted for a small sized family. The lot fronts 40 feet and runs back 200 feet. Terms. -- One-third cash; the balance at 4 and 8 months, for negotiable notes, with interest added, and secured by a trust deed. se 5 Jas. M. Taylor & Son, Auct'rs."
example <- d1862 %>%
filter(id=="1862-04-07_advert_207")
paste(example[5])## [1] "Desirable Framed House and Lot on Federal street, in Sidney, at Auction. -- We will sell upon the premises, on Tuesday, the 8th day of April, commencing at 4 ½ o'clock P. M., the desirable framed House, on the north side of Elmwood, near to Grove street, now in the occupancy of A. B. Hall. It contains several good rooms, Kitchen, & c., and adapted for a small-sized family; a well of water in the yard. The Lot fronts 30 feet, and runs back 180 feet to an alley 20 feet wide. Terms -- One-third cash; the balance at 6 and 12 months, for negotiable notes, with interest added, and secured by a trust deed. The purchaser to pay the taxes for 1862. James M. Taylor & Son, Auctioneers."
euclidMatrix = sim2(dtm, dtm, method = "cosine", norm = "none")
euclidMatrix@Dimnames[[1]] <- as.vector(sample_d1862$id)
euclidMatrix@Dimnames[[2]] <- as.vector(sample_d1862$id)Let’s take a look at a small section of our matrix. Can you read it? How should this data look in tidy format?
euclidMatrix[1:4, 1:2]## 4 x 2 sparse Matrix of class "dgCMatrix"
## 1862-04-22_advert_244 1862-04-22_advert_245
## 1862-04-22_advert_244 143 20
## 1862-04-22_advert_245 20 32
## 1862-04-16_advert_6 38 14
## 1862-04-16_advert_7 16 6
Converting matrix into a proper tidy data frame is a bit tricky. Luckily, igraph library can be extremely helpful here. We can treat our matrix as edges, where each number is the weight of each given edge. Loading this data into igraph will help us to avoid heavy-lifting on convertion as it can do all the complicated reconfigurations of our data, converting it into a proper dataframe that conforms to the principles of tidy data.
All steps include:
igraph object from our regular matrix;euclidMatrix <- as.matrix(euclidMatrix)
library(igraph)
euclidNW <- graph.adjacency(euclidMatrix, mode="undirected", weighted=TRUE)
euclidNW <- simplify(euclidNW)
euclid_sim_df <- as_data_frame(euclidNW, what="edges")
colnames(euclid_sim_df) <- c("text1", "text2", "euclidSimilarity")t_euclid_sim_df_subset <- euclid_sim_df %>%
filter(euclidSimilarity > 0.49) %>%
arrange(desc(euclidSimilarity), .by_group=T)
t_euclid_sim_df_subsetLet’s check the texts of 1862-03-10_advert_166 and 1862-02-10_advert_170, which have the score of 1.0000000 (a complete match).
example <- d1862 %>%
filter(id=="1862-09-08_advert_171")
paste(example[5])## [1] "A desirable Framed House and Lot on Mill street, in Adams's Valley, at Auction. -- We will sell upon the premises, on Thursday, the 11th day of September, commencing at 4½ o'clock P. M., a very desirable Framed House and Lot located on Mill street, in Adams's Valley, and near to the new workshops of the Virginia Central Railroad Company. It contains four good rooms, and is particularly adapted for a small sized family. The lot fronts 40 feet and runs back 200 feet. Terms. -- One-third cash; the balance at 4 and 8 months, for negotiable notes, with interest added, and secured by a trust deed. se 5 Jas. M. Taylor & Son, Auct'rs."
example <- d1862 %>%
filter(id=="1862-04-07_advert_207")
paste(example[5])## [1] "Desirable Framed House and Lot on Federal street, in Sidney, at Auction. -- We will sell upon the premises, on Tuesday, the 8th day of April, commencing at 4 ½ o'clock P. M., the desirable framed House, on the north side of Elmwood, near to Grove street, now in the occupancy of A. B. Hall. It contains several good rooms, Kitchen, & c., and adapted for a small-sized family; a well of water in the yard. The Lot fronts 30 feet, and runs back 180 feet to an alley 20 feet wide. Terms -- One-third cash; the balance at 6 and 12 months, for negotiable notes, with interest added, and secured by a trust deed. The purchaser to pay the taxes for 1862. James M. Taylor & Son, Auctioneers."
Work through Chapter 9 of Arnold, Taylor, and Lauren Tilton. 2015. Humanities Data in R. New York, NY: Springer Science+Business Media. (on Moodle!): create a notebook with all the code discusses there and send it via email (share via DropBox or some other service, if too large).